Submitted to Prof. Sayantan Banerjee for the course Statistical Applications of R 2020, IIM Indore.
Climate change is a very big problem in contemporary times, and India, as the second most populous country in the world, contributes significantly to air pollution with many cities in the top 50 most polluted cities in the world. But the extent of pollution is not properly studied and used in decision making in India. Air quality monitoring stations are present in many cities but the data isn’t used for predictions in most cases, but rather as an observation. Another problem is that these AQI monitoring stations aren’t present in small cities. The project aims to build a forecasting model to predict AQI. The project focuses on the city of Indian cities like Delhi, Mumbai and Chennai which are infamous for their pollution levels.
The plots on the side reveal how developing countries (Low Income Mediterranean, South East Asia) contribute most to the particulate pollution. Moreover the other plots reveal how India compares to the rest of the world
LMIC: Low Middle Income Countries
LMIC: High Income Countries
The plots on the side show the amounts of pollutants and their spread across various cities in India. Particulate Matter is not shown here. The pollutants shown here make up significant portion of the Air Quality Index which is a widely used measure. We will be using the AQI as well in this project
The Chennai model has a lot of missing values and hence cannot be predicted properly, hence the apparent lack of seasonality. We drop Chennai from our analysis.
[1] "Delhi"
Series: AQID.ts
ARIMA(1,1,2)(0,1,0)[365]
Coefficients:
ar1 ma1 ma2
0.5423 -0.6921 -0.2647
s.e. 0.0347 0.0383 0.0334
sigma^2 estimated as 4716: log likelihood=-9282.91
AIC=18573.83 AICc=18573.85 BIC=18595.45
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.7725529 62.04998 41.97265 -4.096785 20.52469 0.5027082
ACF1
Training set -0.003184295
[1] "Mumbai"
Series: AQIM.ts
ARIMA(1,0,0)(0,1,0)[365] with drift
Coefficients:
ar1 drift
0.7150 -0.0262
s.e. 0.0339 0.0121
sigma^2 estimated as 675.2: log likelihood=-1972.82
AIC=3951.64 AICc=3951.7 BIC=3963.78
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.01404403 18.98299 9.353563 -1.96953 9.687569 0.3399684
ACF1
Training set 0.02590205
[1] "Chennai"
Series: AQIC.ts
ARIMA(4,1,3)
Coefficients:
ar1 ar2 ar3 ar4 ma1 ma2 ma3
1.5983 -0.8616 0.1788 -0.0069 -1.9292 1.162 -0.2259
s.e. 0.2212 NaN NaN 0.0138 0.2212 NaN NaN
sigma^2 estimated as 1228: log likelihood=-9579.86
AIC=19175.72 AICc=19175.8 BIC=19220.23
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -1.28413 34.9666 22.85284 -7.48814 20.17663 0.4631568 -0.001062613
These ARIMAX models are done using all the meteorological variables. This gives a very models with larger MAPE than the ARIMA model without an X variable/ matrix. This can be due to high deviations in the meteorological variables. A distributed lagged model was also fitted to check. ARIMA with no X variable is the best in any case.
[1] "Delhi"
Series: ydl
Regression with ARIMA(1,0,0) errors
Coefficients:
ar1 temp pressure humidity wind
0.8746 1.8201 0.0421 0.2311 0.3091
s.e. 0.0191 0.5682 0.0214 0.1363 0.3253
sigma^2 estimated as 426.9: log likelihood=-3498.19
AIC=7008.37 AICc=7008.48 BIC=7036.38
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.0194588 20.59614 13.4242 -3.261841 12.46601 1.011254 0.03280456
[1] "Mumbai"
Series: ydl
Regression with ARIMA(2,1,2) errors
Coefficients:
ar1 ar2 ma1 ma2 temp pressure humidity wind
1.0552 -0.4096 -1.2309 0.3260 -1.2183 0.3911 -0.2039 0.0950
s.e. 0.1460 0.0964 0.1531 0.1417 1.0391 0.7103 0.1454 0.4229
sigma^2 estimated as 390.3: log likelihood=-3456.66
AIC=6931.33 AICc=6931.56 BIC=6973.33
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.07747973 19.64383 12.64135 -2.087419 11.39747 0.9522809
ACF1
Training set -0.001610694
[1] "Chennai"
Series: ydl
Regression with ARIMA(1,1,1) errors
Coefficients:
ar1 ma1 temp pressure humidity wind
0.6278 -0.9694 -1.2233 -0.2736 0.1754 -0.0954
s.e. 0.0248 0.0098 1.3770 0.8662 0.2518 0.5131
sigma^2 estimated as 1406: log likelihood=-7557.43
AIC=15128.87 AICc=15128.94 BIC=15166.05
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -1.553327 37.40857 25.00667 -8.123409 21.2297 0.9668808
ACF1
Training set 0.0009152456
Box-Ljung test
data: ARIMAXD.out$residuals
X-squared = 6.5218, df = 10, p-value = 0.7697
Box-Ljung test
data: ARIMAXD.out$residuals^2
X-squared = 49.033, df = 1, p-value = 2.517e-12
Box-Ljung test
data: ARIMAXM.out$residuals
X-squared = 15.527, df = 10, p-value = 0.114
Box-Ljung test
data: ARIMAXM.out$residuals^2
X-squared = 125.9, df = 1, p-value < 2.2e-16
Box-Ljung test
data: ARIMAXC.out$residuals
X-squared = 14.726, df = 10, p-value = 0.1424
Box-Ljung test
data: ARIMAXC.out$residuals^2
X-squared = 142.62, df = 1, p-value < 2.2e-16
# This code was used to scrape meterological data from the web
data<-NULL
for ( year in 2019:2020)
{
for(month in 1:12)
{
monthname<-c('january','february','march','april','may','june','july','august','september','october','november','december')
if (monthname[month] %in% c('january','march','may','july','august','october','december')){j=31}
else {j=30}
if(month==2&year%%4==0){j=29}
else if (month==2&year%%4!=0){j=28}
for(day in 1:j)
{
url<-read_html(paste0('https://en.tutiempo.net/records/vidp/',day,'-',monthname[month],'-',year,'.html'))
temp.raw<-url%>%
html_nodes('.Temp')%>%
html_text()
temp.raw<-gsub('[^\x20-\x7E]', '', temp.raw)
humid.raw<-url%>%
html_nodes('.hr')%>%
html_text()
humid.raw<-str_remove(humid.raw,'%')
wind.raw<-url%>%
html_nodes('.wind')%>%
html_text()
wind.raw<-str_remove(wind.raw,' km/h')
wind.raw<-na.omit(str_extract(wind.raw, '[[:digit:]]+'))
pressure.raw<-url%>%
html_nodes('.prob')%>%
html_text()
pressure.raw<-str_remove(pressure.raw,'hPa')
humidity<-mean(na.omit(as.integer(humid.raw)))
temp<-mean(na.omit(as.integer(temp.raw)))
pressure<-mean(na.omit(as.integer(pressure.raw)))
wind<-mean(na.omit(as.integer(wind.raw)))
Date<-as.Date(paste0(year,'-',month,'-',day))
datatemp<-data.frame(Date,temp,pressure,humidity,wind)
data<-rbind(data,datatemp)
}
}
nametemp<-paste('data',year,sep='_')
assign(nametemp,data)
}
data2<-rbind(data_2015,data_2016,data_2017,data_2018,data_2019,data_2020)
write.csv(data2,'data2.csv', row.names = FALSE)---
title: 'Air Quality Prediction'
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
theme: flatly
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(htmltab)
library(tsbox)
library(stR)
library(quantmod)
library(tseries)
library(forecast)
library(dygraphs)
library(leaflet)
library(ggplot2)
library(plotly)
library(readxl)
data <- read.csv('~/R/AQI prediction/data2.csv')
data$Date<-as.Date(data$Date)
AQI <- read.csv('~/R/AQI prediction/Data/city_day.csv')
AQID<-dplyr::filter(AQI,City=="Delhi")
AQIM<-dplyr::filter(AQI,City=="Mumbai")
AQIC<-dplyr::filter(AQI,City=="Chennai")
AQID<-AQID[,c("Date","AQI")]
AQID$Date<-as.Date(AQID$Date,format="%d-%m-%Y")
AQIM<-AQIM[,c("Date","AQI")]
AQIM$Date<-as.Date(AQIM$Date,format="%d-%m-%Y")
AQIC<-AQIC[,c("Date","AQI")]
AQIC$Date<-as.Date(AQIC$Date,format="%d-%m-%Y")
AQID.ts<-ts(AQID$AQI,start=c(2015,1),frequency=365)
AQID.ts<-na.approx(AQID.ts)
AQIM.ts<-ts(AQIM$AQI,start=c(2015,1),frequency=365)
AQIM.ts<-na.approx(AQIM.ts)
AQIC.ts<-ts(AQIC$AQI,start=c(2015,1),frequency=365)
AQIC.ts<-na.approx(AQIC.ts)
```
Submitted to Prof. Sayantan Banerjee for the course Statistical Applications of R 2020, IIM Indore.
```{r , include=FALSE}
worldPM1025 <- read_excel("aap_air_quality_database_2018_v14.xlsx", sheet = "database",
col_types = c("text","skip", "text", "skip", "skip", "text", "skip", "skip",
"text", "skip", "skip", "skip", "skip", "skip", "skip"), skip = 2)
#Source: WHO Global Ambient Air Quality Database - update 2018
#Preprocessing
names(worldPM1025)<-c("Region", "Country","PM10","PM2.5")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"-converted value","")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"[)]","")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"[(]","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"-converted value","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"[)]","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"[(]","")
worldPM1025$PM2.5<-as.numeric(worldPM1025$PM2.5)
worldPM1025$PM10<-as.numeric(worldPM1025$PM10)
World2.5<-aggregate(x=worldPM1025$PM2.5,by=list(Country=worldPM1025$Country,Region=worldPM1025$Region),FUN = mean)
World10<-aggregate(x=worldPM1025$PM10,by=list(Country=worldPM1025$Country,Region=worldPM1025$Region),FUN = mean)
#Source:https://www.kaggle.com/paultimothymooney/latitude-and-longitude-for-every-country-and-state
countries <- read.csv("~/countries.csv")
names(countries)<-c("code","lat","lon","Country")
mergeddata2.5<-merge(World2.5,countries,type="left")
mergeddata10<-merge(World10,countries,type="left")
# Create a palette that maps factor levels to colors
pal <- colorNumeric(palette = "YlOrRd", domain = mergeddata2.5$x)
pal2 <- colorNumeric(palette = "YlOrRd", domain = mergeddata10$x)
```
```{r, include = FALSE}
o=data.frame(na.omit(AQI[,c(1,5:9)]))
fin<-aggregate(o,by=list(o$City),FUN=mean)
fin<-na.omit(fin[,-2])
names(fin)=c("city","NO","NO2","NOx","NH3","CO")
indlatlon <- read.csv("~/in.csv")
indiaf<-merge(fin,indlatlon,type="inner",by="city")
```
Overview {data-navmenu="Intro"}
=====================================
Column
-------------------------------------
### Overview
The Problem
Climate change is a very big problem in contemporary times, and India, as the second most populous country in the world, contributes significantly to air pollution with many cities in the top 50 most polluted cities in the world. But the extent of pollution is not properly studied and used in decision making in India. Air quality monitoring stations are present in many cities but the data isn’t used for predictions in most cases, but rather as an observation. Another problem is that these AQI monitoring stations aren’t present in small cities. The project aims to build a forecasting model to predict AQI. The project focuses on the city of Indian cities like Delhi, Mumbai and Chennai which are infamous for their pollution levels.
### India NOx pollution
```{r}
indiafplot3<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NOx)
pal5 <- colorNumeric(palette = "YlOrRd", domain = indiafplot3$x)
leaflet(indiafplot3) %>% addTiles() %>%
addCircleMarkers(
radius = ~x,
color = ~pal5(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Column
-------------------------------------
### Pollution level by continents and income (PM 10)
```{r}
library(plotly)
fig <- plot_ly(World10, y = ~x, color = ~Region, type = "box")
fig
```
### Polution levels in different countries (PM 10)
```{r}
leaflet(mergeddata10) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/5,
color = ~pal2(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Data Summary {data-navmenu="Intro"}
=====================================
Column
-------------------------------------
### Meteorological data
Meteorological data
The Meteorological data was scrapped from tutiempo.com. The coding for the scrapper is given with the file.The values collected were Temperature, Purssure, Humidity and Wind Speed.
### AQI Data
AQI Data
The AQI data was downloaded from Kaggle. The data set has AQI, CO, NO, NOx, NH3, SO2, PM2.5, PM10 emissions measured in micro grams per meter cube.
### World pollution Data
World pollution Data
This data was downloaded from WHO Global Ambient Air Quality Database - update 2018.It had locationwise pollution data for countries
Column
-------------------------------------
### India pollution Data
India pollution Data
The AQI data was manipulated and grouped to obtain mean pollution across various cities in India.
### Latitude Longitude countries
Latitude Longitude countries
This Data was downloaded from Kaggle.Since goggle's ggmap is no more free. This was a better option than buying an API.
### Latitude Longitude Indian cities
Latitude Longitude Indian cities
This data had location (geocode) of different Indian cities. This is matched with existing pollution data for visualizations.
World Pollution Levels {data-navmenu="EDA"}
=========================================
Column
-------------------------------------
### Plot Inference
Pollution Levels around the world
The plots on the side reveal how developing countries (Low Income Mediterranean, South East Asia) contribute most to the particulate pollution. Moreover the other plots reveal how India compares to the rest of the world
LMIC: Low Middle Income Countries
LMIC: High Income Countries
### PM 2.5 across countries mapped
```{r}
leaflet(mergeddata2.5) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/5,
color = ~pal(x),
stroke = FALSE, fillOpacity = 0.5
)
```
### PM 10 across countries mapped
```{r}
leaflet(mergeddata10) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/5,
color = ~pal2(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Column
-------------------------------------
### PM 10 across continents
```{r}
library(plotly)
fig <- plot_ly(World10, y = ~x, color = ~Region, type = "box")
fig
```
### PM 2.5 across continents
```{r}
library(plotly)
fig1 <- plot_ly(World2.5, y = ~x, color = ~Region, type = "box")
fig1
```
India Pollution Levels{ data-navmenu="EDA"}
=========================================
Column
-------------------------------------
### Plot Inference
Pollution Levels in India
The plots on the side show the amounts of pollutants and their spread across various cities in India. Particulate Matter is not shown here. The pollutants shown here make up significant portion of the Air Quality Index which is a widely used measure. We will be using the AQI as well in this project
### CO
```{r}
indiafplot1<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$CO)
pal3 <- colorNumeric(palette = "YlOrRd", domain = indiafplot1$x)
leaflet(indiafplot1) %>% addTiles() %>%
addCircleMarkers(
radius = ~x*15,
color = ~pal3(x),
stroke = FALSE, fillOpacity = 0.5
)
```
### NO2
```{r}
indiafplot2<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NO2)
pal4 <- colorNumeric(palette = "YlOrRd", domain = indiafplot2$x)
leaflet(indiafplot2) %>% addTiles() %>%
addCircleMarkers(
radius = ~x,
color = ~pal4(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Column
-------------------------------------
### NOx
```{r}
indiafplot3<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NOx)
pal5 <- colorNumeric(palette = "YlOrRd", domain = indiafplot3$x)
leaflet(indiafplot3) %>% addTiles() %>%
addCircleMarkers(
radius = ~x,
color = ~pal5(x),
stroke = FALSE, fillOpacity = 0.5
)
```
### NO
```{r}
indiafplot4<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NO)
pal6 <- colorNumeric(palette = "YlOrRd", domain = indiafplot4$x)
leaflet(indiafplot4) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/2,
color = ~pal6(x),
stroke = FALSE, fillOpacity = 0.5
)
```
### NH3
```{r}
indiafplot5<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NH3)
pal7 <- colorNumeric(palette = "YlOrRd", domain = indiafplot5$x)
leaflet(indiafplot5) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/2,
color = ~pal7(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Delhi{data-navmenu="Meterological Trends"}
=========================================
Column
-------------------------------------
### Delhi
```{r}
m <- leaflet() %>%
addTiles() %>%
addMarkers(lng=77.1025, lat=28.7041, popup="Delhi")
m
```
### Temperature
```{r}
datadelhi <- read.csv("datadelhi.csv")
datadelhi$Date<-as.Date(datadelhi$Date)
x <- datadelhi$Date
y<-datadelhi$temp
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
Column
-------------------------------------
### Pressure
```{r}
x <- datadelhi$Date
y<-datadelhi$pressure
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
### Humidity
```{r}
x <- datadelhi$Date
y<-datadelhi$humidity[datadelhi$humidity<1000]
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
### Wind
```{r}
x <- datadelhi$Date
y<-datadelhi$wind
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
Mumbai{data-navmenu="Meterological Trends"}
=========================================
Column
-------------------------------------
### Mumbai
```{r}
m <- leaflet() %>%
addTiles() %>%
addMarkers(lng=72.8777, lat=19.0760, popup="Mumbai")
m
```
### Temperature
```{r}
datamumbai <- read.csv("datamumbai.csv")
datamumbai$Date<-as.Date(datamumbai$Date)
x <- datamumbai$Date
y<-datamumbai$temp
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
Column
-------------------------------------
### Pressure
```{r}
x <- datamumbai$Date[datamumbai$pressure<1100]
y<-datamumbai$pressure[datamumbai$pressure<1100]
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
### Humidity
```{r}
x <- datamumbai$Date
y<-datamumbai$humidity
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
### Wind
```{r}
x <- datamumbai$Date
y<-datamumbai$wind
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
Chennai{data-navmenu="Meterological Trends"}
=========================================
Column
-------------------------------------
### Chennai
```{r}
m <- leaflet() %>%
addTiles() %>%
addMarkers(lng=80.2707, lat=13.0827, popup="Chennai")
m
```
### Temperature
```{r}
datachennai <- read.csv("datachennai.csv")
datachennai$Date<-as.Date(datachennai$Date)
x <- datachennai$Date
y<-datachennai$temp
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
Column
-------------------------------------
### Pressure
```{r}
x <- datachennai$Date
y<-datachennai$pressure
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
### Humidity
```{r}
x <- datachennai$Date
y<-datachennai$humidity
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
### Wind
```{r}
x <- datachennai$Date[datachennai$wind<22 ]
y<-datachennai$wind[datachennai$wind<22 ]
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("days from today"))
fig
```
Methodology {data-navmenu="Time Series Analysis"}
=========================================
Column
-------------------------------------
### Meteorological data
Meteorological data
The Meteorological data was scrapped from tutiempo.com. The coding for the scrapper is given with the file.The values collected were Temperature, Purssure, Humidity and Wind Speed.
### AQI Data
AQI Data
The AQI data was downloaded from Kaggle. The data set has AQI, CO, NO, NOx, NH3, SO2, PM2.5, PM10 emissions measured in micro grams per meter cube.
### World pollution Data
World pollution Data
This data was downloaded from WHO Global Ambient Air Quality Database - update 2018.It had locationwise pollution data for countries
Column
-------------------------------------
### India pollution Data
India pollution Data
The AQI data was manipulated and grouped to obtain mean pollution across various cities in India.
### Latitude Longitude countries
Latitude Longitude countries
This Data was downloaded from Kaggle.Since goggle's ggmap is no more free. This was a better option than buying an API.
### Latitude Longitude Indian cities
Latitude Longitude Indian cities
This data had location (geocode) of different Indian cities. This is matched with existing pollution data for visualizations.
ACF/PACF {data-navmenu="Time Series Analysis"}
=========================================
### Importance {.tabset}
-------------------------------------
### ACF Delhi
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQID.ts))
bacf <- pacf(AQID.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### PACF Delhi
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQID.ts))
bacf <- acf(AQID.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### ACF Mumbai
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIM.ts))
bacf <- pacf(AQIM.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### PACF Mumbai
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIM.ts))
bacf <- acf(AQIM.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### ACF Chennai
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIC.ts))
bacf <- pacf(AQIC.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### PACF Chennai
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIC.ts))
bacf <- acf(AQIC.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
SARIMA {data-navmenu="Time Series Analysis"}
=========================================
Column
-------------------------------------
### Summary
Fitted Models
Delhi: SARIMA(1,1,2)(0,1,0)
Mumbai: SARIMA(1,0,0)(0,1,0)
Chennai: SARIMA(4,1,3)
The Chennai model has a lot of missing values and hence cannot be predicted properly, hence the apparent lack of seasonality. We drop Chennai from our analysis.
### Delhi
```{r}
set.seed(1000)
ARIMAXD.out=auto.arima(y = AQID.ts)
forcastD<-forecast(object = ARIMAXD.out,h = 365)
x1d <-seq( from =as.Date("2020-01-08"),to = as.Date("2021-01-08"),length.out = 365)
y1d<-as.numeric(forcastD$mean)
figD <- plot_ly(x = ~x1d, y = ~y1d, mode = 'lines',type = "scatter")
plot(forcastD)
```
Column
-------------------------------------
### Mumbai
```{r}
set.seed(1000)
ARIMAXM.out=auto.arima(y = AQIM.ts)
forcastM<-forecast(object = ARIMAXM.out,h = 365)
plot(forcastM)
x1m <-seq( from =as.Date("2020-01-08"),to = as.Date("2021-01-08"),length.out = 365)
y1m<-as.numeric(forcastM$mean)
figM <- plot_ly(x = ~x1m, y = ~y1m, mode = 'lines',type = "scatter")
```
### Chennai
```{r}
set.seed(1000)
ARIMAXC.out=auto.arima(y = AQIC.ts)
forcastC<-forecast(object = ARIMAXC.out,h = 365)
plot(forcastC)
x1 <-seq( from =as.Date("2020-01-08"),to = as.Date("2021-01-08"),length.out = 365)
y1<-as.numeric(forcastC$mean)
fig <- plot_ly(x = ~x1, y = ~y1, mode = 'lines',type = "scatter")
```
Model Results SARIMA { data-navmenu="Time Series Analysis"}
===================================================================
### Importance {.tabset}
-------------------------------------
### Delhi
```{r}
print("Delhi")
summary(ARIMAXD.out)
```
### Mumbai
```{r}
print("Mumbai")
summary(ARIMAXM.out)
```
### Chennai
```{r}
print("Chennai")
summary(ARIMAXC.out)
```
SARIMAX {data-navmenu="Time Series Analysis"}
=========================================
Column
-------------------------------------
### Summary
Fitted Models
Delhi: SARIMAX(1,0,0)
Mumbai: SARIMA(2,1,2)
Chennai: SARIMA(1,1,1)
These ARIMAX models are done using all the meteorological variables. This gives a very models with larger MAPE than the ARIMA model without an X variable/ matrix. This can be due to high deviations in the meteorological variables. A distributed lagged model was also fitted to check. ARIMA with no X variable is the best in any case.
### Delhi
```{r}
datadelhi <- read.csv("datadelhi.csv")
datadelhi$Date<-as.Date(datadelhi$Date)
ydl=as.ts(AQIM.ts[1:1500])
xdl=as.matrix(datadelhi[1:1500,-1])
ARIMAXD2.out=auto.arima(y = ydl, xreg = xdl)
forcastdlD<-forecast(object = ARIMAXD2.out,h = 365,xreg=as.matrix(datadelhi[1501:1865,-1]))
plot(forcastdlD)
```
Column
-------------------------------------
### Mumbai
```{r}
datamumbai <- read.csv("datamumbai.csv")
datamumbai$Date<-as.Date(datamumbai$Date)
ydl=as.ts(AQIM.ts[1:1500])
xdl=as.matrix(datamumbai[1:1500,-1])
ARIMAXM2.out=auto.arima(y = ydl, xreg = xdl)
forcastdlM<-forecast(object = ARIMAXM2.out,h = 365,xreg=as.matrix(datamumbai[1501:1865,-1]))
plot(forcastdlM)
```
### Chennai
```{r}
datachennai <- read.csv("datachennai.csv")
datachennai$Date<-as.Date(datachennai$Date)
ydl=as.ts(AQIC.ts[1:1500])
xdl=as.matrix(datachennai[1:1500,-1])
ARIMAXC2.out=auto.arima(y = ydl, xreg = xdl)
forcastdlC<-forecast(object = ARIMAXC2.out,h = 365,xreg=as.matrix(datachennai[1501:1865,-1]))
plot(forcastdlC)
```
Model Results SARIMAX { data-navmenu="Time Series Analysis"}
===================================================================
### Importance {.tabset}
-------------------------------------
### Delhi
```{r}
print("Delhi")
summary(ARIMAXD2.out)
```
### Mumbai
```{r}
print("Mumbai")
summary(ARIMAXM2.out)
```
### Chennai
```{r}
print("Chennai")
summary(ARIMAXC2.out)
```
Ljung - Box Residual Test { data-navmenu="Time Series Analysis"}
===================================================================
### Importance {.tabset}
-------------------------------------
### Delhi
```{r}
Box.test(ARIMAXD.out$residuals,lag=10,type="Ljung")
Box.test(x = ARIMAXD.out$residuals^2,lag=1,type="Ljung")
```
### Mumbai
```{r}
Box.test(ARIMAXM.out$residuals,lag=10,type="Ljung")
Box.test(x = ARIMAXM.out$residuals^2,lag=1,type="Ljung")
```
### Chennai
```{r}
Box.test(ARIMAXC.out$residuals,lag=10,type="Ljung")
Box.test(x = ARIMAXC.out$residuals^2,lag=1,type="Ljung")
```
Comparison & Results {data-navmenu="Time Series Analysis"}
===================================================================
Column
-------------------------------------
### Conclusion
Conclusion
We can use the following forecasts to forecast future Air Quality in Delhi and Mumbai.
### Delhi Final Prediction
```{r}
figD
```
Column
-------------------------------------
### Limitations
Limitations
The Ljung Box test confirmed conditional heteroscedasticity. GARCH models were fitted but the standard deviations were too high. The very nature of this data is highly volatile. Therefore we have obtained a reasonable prediction. Usage of a fourier curve for the seasonal component or modern techniques like LSTM Neural Nets might give better results.
### Mumbai Final Prediction
```{r}
figM
```
Web Scrapper
=========================================
```{r, eval=FALSE, echo=TRUE}
# This code was used to scrape meterological data from the web
data<-NULL
for ( year in 2019:2020)
{
for(month in 1:12)
{
monthname<-c('january','february','march','april','may','june','july','august','september','october','november','december')
if (monthname[month] %in% c('january','march','may','july','august','october','december')){j=31}
else {j=30}
if(month==2&year%%4==0){j=29}
else if (month==2&year%%4!=0){j=28}
for(day in 1:j)
{
url<-read_html(paste0('https://en.tutiempo.net/records/vidp/',day,'-',monthname[month],'-',year,'.html'))
temp.raw<-url%>%
html_nodes('.Temp')%>%
html_text()
temp.raw<-gsub('[^\x20-\x7E]', '', temp.raw)
humid.raw<-url%>%
html_nodes('.hr')%>%
html_text()
humid.raw<-str_remove(humid.raw,'%')
wind.raw<-url%>%
html_nodes('.wind')%>%
html_text()
wind.raw<-str_remove(wind.raw,' km/h')
wind.raw<-na.omit(str_extract(wind.raw, '[[:digit:]]+'))
pressure.raw<-url%>%
html_nodes('.prob')%>%
html_text()
pressure.raw<-str_remove(pressure.raw,'hPa')
humidity<-mean(na.omit(as.integer(humid.raw)))
temp<-mean(na.omit(as.integer(temp.raw)))
pressure<-mean(na.omit(as.integer(pressure.raw)))
wind<-mean(na.omit(as.integer(wind.raw)))
Date<-as.Date(paste0(year,'-',month,'-',day))
datatemp<-data.frame(Date,temp,pressure,humidity,wind)
data<-rbind(data,datatemp)
}
}
nametemp<-paste('data',year,sep='_')
assign(nametemp,data)
}
data2<-rbind(data_2015,data_2016,data_2017,data_2018,data_2019,data_2020)
write.csv(data2,'data2.csv', row.names = FALSE)
```
Credits
=========================================
Column
-------------------------------------
### Group No
Group No: 28
### Member
Wellington Daniel
2016IPM118
### Member
Oshin Singhal
2016IPM070
Column
-------------------------------------
### Section
Section: A
### Member
Shashank Giri
2016IPM094
### Member
Aditya Badonia
2016IPM004